---
id: synthesizer-generate-from-docs
title: Generate From Documents
sidebar_label: Generate From Documents
---

<head>
  <link
    rel="canonical"
    href="https://deepeval.com/docs/synthesizer-generate-from-docs"
  />
</head>

If your application is a Retrieval-Augmented Generation (RAG) system, generating Goldens from documents can be particularly useful, especially if you already have access to the **documents that make up your knowledge base**. By simply providing these documents, the Synthesizer will automatically handle generating the relevant contexts needed for synthesizing test Goldens.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/synthesize-from-docs.svg"
    style={{
      marginTop: "20px",
      marginBottom: "50px",
      height: "auto",
      maxHeight: "800px",
    }}
  />
</div>

:::tip DID YOU KNOW?
The only difference between the `generate_goldens_from_docs()` and `generate_goldens_from_contexts()` method is `generate_goldens_from_docs()` involves an additional [context construction step.](#how-does-context-construction-work)
:::

## Generate Your Goldens

:::note
If you do not have an `OPENAI_API_KEY` and wish to synthesize goldens, you'll need to use [**custom embedding models**](/guides/guides-using-custom-embedding-models) in addition to custom LLMs.
:::

Before you begin, you must install `chromadb` as an additional dependency when generating from documents. The use of a vector database allows for efficient retrieval of chunks during context construction.

```python
pip install chromadb
```

Then, to generate synthetic `Golden`s from documents, simply provide a list of document paths:

```python
from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
    document_paths=['example.txt', 'example.docx', 'example.pdf'],
)
```

There are **ONE** mandatory and **THREE** optional parameters when using the `generate_goldens_from_docs` method:

- `document_paths`: a list strings, representing the path to the documents from which contexts will be extracted from. Supported documents types include: `.txt`, `.docx`, and `.pdf`.
- [Optional] `include_expected_output`: a boolean which when set to `True`, will additionally generate an `expected_output` for each synthetic `Golden`. Defaulted to `True`.
- [Optional] `max_goldens_per_context`: the maximum number of goldens to be generated per context. Defaulted to 2.
- [Optional] `context_construction_config`: an instance of type `ContextConstructionConfig` that allows you to [customize the quality and attributes of contexts constructed](#customize-context-construction) from your documents. Defaulted to the default `ContextConstructionConfig` values.

:::info
The final maximum number of goldens to be generated is the `max_goldens_per_context` multiplied by the `max_contexts_per_document` as specified in the `context_construction_config`, and **NOT** simply `max_goldens_per_context`.
:::

## Customize Context Construction

You can customize the quality of contexts constructed from documents by providing a `ContextConstructionConfig` instance to the `generate_goldens_from_docs()` method at generation time.

```python
from deepeval.synthesizer.config import ContextConstructionConfig

...
synthesizer.generate_goldens_from_docs(
  document_paths=['example.txt', 'example.docx', 'example.pdf'],
  context_construction_config=ContextConstructionConfig()
)
```

There are **SEVEN** optional parameters when creating a `ContextConstructionConfig`:

- [Optional] `critic_model`: a string specifying which of OpenAI's GPT models to use to determine context `quality_score`s, **OR** [any custom LLM model](/docs/metrics-introduction#using-a-custom-llm) of type `DeepEvalBaseLLM`. Defaulted to the **model used in the `Synthesizer`**, else `gpt-4o` when initialized as a standalone instance.
- [Optional] `encoding`: the encoding to use to decode .txt files. Defaulted to auto detecting the encoding.
- [Optional] `max_contexts_per_document`: the maximum number of contexts to be generated per document. Defaulted to 3.
- [Optional] `min_contexts_per_document`: the minimum number of contexts to be generated per document. Defaulted to 1.
- [Optional] `max_context_length`: specifies the number of of text chunks to be generated per context (context length). Defaulted to 3.
- [Optional] `min_context_length`: specifies the minimum number of text chunks to be generated per context (context length). Defaulted to 1.
- [Optional] `chunk_size`: specifies the size of text chunks (in characters) to be considered during [document parsing](#synthesizer-generate-from-docs#document-parsing). Defaulted to 1024.
- [Optional] `chunk_overlap`: an int that determines the overlap size between consecutive text chunks during [document parsing](#synthesizer-generate-from-docs#document-parsing). Defaulted to 0.
- [Optional] `context_quality_threshold`: a float representing the minimum quality threshold for [context selection](synthesizer-generate-from-docs#context-selection). If the context quality is below threshold, the context will be rejected. Defaulted to `0.5`.
- [Optional] `context_similarity_threshold`: a float representing the minimum similarity score required for [context grouping](synthesizer-generate-from-docs#context-grouping). Contexts with similarity scores below this threshold will be rejected. Defaulted to `0.5`.
- [Optional] `max_retries`: an integer that specifies the number of times to retry context selection **OR** grouping if it does not meet the required quality **OR** similarity threshold. Defaulted to `3`.
- [Optional] `embedder`: a string specifying which of OpenAI's embedding models to during document parsing and context grouping, **OR** [any custom embedding model](/guides/guides-using-custom-embedding-models) of type `DeepEvalBaseEmbeddingModel`. Defaulted to 'text-embedding-3-small'.

:::caution
**Unlike other customizations where configurations to your `Synthesizer` generation pipeline is defined at point of instantiating a `Synthesizer`**, customizing context construction happens at the generation level because context construction is unique to the `generate_goldens_from_docs()` method.

To learn how to customize all other aspects of your generation pipeline, such as output formats, evolution complexity, [click here.](/docs/synthesizer-introduction#customize-your-generations)
:::

## How Does Context Construction Work?

The `generate_goldens_from_docs()` method has an additional context construction pipeline that precedes the [goldens generation pipeline](#synthesizer-introduction#how-does-it-work). This is because to generate goldens grounded in context, we first have to extract and construction groups of contexts found in provided documents.

The context construction pipeline consist of three main steps:

- **Document Parsing**: Split documents into smaller, manageable chunks.
- **Context Selection**: Select random chunks from the parsed, embedded documents.
- **Context Grouping**: Group chunks that are similar in semantics (using cosine similarity) to create groups of contexts that are meaningful enough for subsequent generation.

[Click here](#customize-context-construction) To learn how to customize every parameters used for the context construction pipeline.

:::info
In summary, the documents are first split into chunks and embedded to form a collection of nodes. Random nodes are then selected, and for each selected node, similar nodes are retrieved and grouped together to create contexts. These contexts are then used to generate synthetic goldens as described in previous sections.
:::

### Document Parsing

In the initial **document parsing** step, each provided document is parsed using an **token-based text splitter**. This means the `chunk_size` and `chunk_overlap` parameters do not guarantee exact text chunk sizes. This approach ensures text chunks are meaningful and coherent, but might lead to variations in the expected size of each `context`.

These text chunks are then embedded by the `embedder` and stored in a vector database for subsequent selection and grouping.

:::caution
The synthesizer will raise an error if `chunk_size` is too large to generate n=`max_contexts_per_document` unique contexts.
:::

### Context Selection

In the **context selection** step, random nodes are selected from the vector database that contains the previously indexed nodes. Each time a node is selected, it is subject to filtering. This is because chunked contexts can result in trivial or undesirable content, such as a series of white spaces or unwanted characters from document structures, which is why filtering is important to ensure subsequently generated goldens are meaningful, relevant, and coherent.

Each chunk is quality scored (0-1) by an LLM (the `critic_model`) based based on the following criteria:

- **Clarity**: How clear and understandable the information is.
- **Depth**: The level of detail and insight provided.
- **Structure**: How well-organized and logical the content is.
- **Relevance**: How closely the content relates to the main topic.

If the quality score is still lower than the `context_quality_threshold` after `max_retries`, the context with the highest quality score will be used. Although this means that you might find context that have failed the filtering process being used, but you will be guaranteed to have context to be used for grouping.

:::note
The `critic_model` in the context construction pipeline can be different from the one used in the [`FiltrationConfig` of the generation pipeline](/docs/synthesizer-introduction#filteration-quality).
:::

### Context Grouping

In the final **context grouping** step, each previously selected nodes are grouped with `max_context_length` other nodes with a cosine similarity score higher than the `context_similarity_threshold`. This ensure that each context is coherent for subsequent generation to happen smoothly.

Similar to the context selection step, if the cosine similarity is still lower than the `context_similarity_threshold` after `max_retries`, the context with the highest similarity score will be used. Although this means that you might find context that have failed the filtering process being used, but you will be guaranteed to have context groups to be used for generation.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "start",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/filtering_context.svg"
    style={{
      marginTop: "20px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
    }}
  />
</div>
